APANPS5420 Assignment 1 - Modern Time Series Assignment¶
Rake Anggoro (ma4716)¶
Executive Summary
This time series analysis project investigates patterns and forecasts of monthly U.S. flight arrival delays using publicly available data from top airline carriers and airports. The objective is to understand delay trends, identify seasonal and external disruptors, and evaluate the forecasting performance of several time series models.
Key Steps and Highlights:
Data Preparation: Cleaned and structured 400K+ flight performance records by month, carrier, and airport from 2005 to 2025. Supplementary features were engineered, including holiday counts, travel season flags, and macro-disruptions (COVID-19, natural disasters, economic shocks).
Exploratory Analysis: Visualized and quantified delays by month, season, carrier, and airport. Key patterns revealed spikes in delay rates during major holidays and notable disruptions in 2020 (COVID-19 impact).
Modeling Approaches Tested:
- Simple Moving Average (SMA): Captured basic seasonality and provided a baseline forecast.
- Exponential Moving Average (EMA): Introduced more responsiveness to recent trends.
- Seasonal-Trend Decomposition (STD): Effectively separated trend and seasonal components, showing high accuracy.
- Facebook Prophet: Applied with and without external regressors (e.g., holidays, disasters). It showed improvements when enriched with features but still underperformed compared to STD in this context.
Performance Summary:
| Model | MAPE (%) | MSE | % Points Outside Bounds |
|---|---|---|---|
| SMA | 28.66 | 1.09×10¹² | 31.09% |
| EMA | 36.84 | 1.47×10¹² | 37.60% |
| STD | 21.00 | 4.62×10¹¹ | 10.87% |
| Prophet (default) | 39.40 | 1.15×10¹² | 14.05% |
| Prophet (w/ regressors) | 28.12 | 7.11×10¹¹ | 17.77% |
Key Takeaways:
- STD outperformed more complex models in both accuracy and robustness for in-sample forecasting.
- Prophet’s performance improved significantly with external features, though still not surpassing STD for this dataset.
- Major external events (COVID-19, hurricanes, economic disruptions) play a clear role in delay patterns and should be included in any serious forecasting pipeline.
This project demonstrates the power of combining domain-driven feature engineering with a comparative approach to time series modeling, enabling more informed operational planning in the aviation sector.
Load Packages¶
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
import plotly.graph_objects as go
from statsmodels.tsa.api import SimpleExpSmoothing
import statsmodels.api as sm
from plotly.subplots import make_subplots
from pandas.tseries.holiday import USFederalHolidayCalendar
import numpy as np
np.float_ = np.float64 # to avoid prophet error of np.float_` was removed in the NumPy 2.0 release
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
Set Config¶
data_dir = "/Users/mrla/Documents/Projects/data/flight_delays/ot_delaycause1_DL"
pd.set_option('display.max_columns', None)
# Use Seaborn's white theme with bold axes and soft gridlines
sns.set_theme(style="ticks", context="talk", palette="muted")
def plot_line(data, x, y, title, ylabel, figsize=(14, 6), hue=None):
plt.figure(figsize=figsize)
ax = sns.lineplot(data=data, x=x, y=y, hue=hue, linewidth=2.5)
ax.set_title(title, fontsize=18, fontweight='bold')
ax.set_ylabel(ylabel)
ax.set_xlabel('')
ax.tick_params(axis='x', rotation=45)
sns.despine() # removes top and right spines
plt.tight_layout()
plt.show()
Load Data¶
This dataset, published by the U.S. Department of Transportation’s Bureau of Transportation Statistics (BTS), contains airline on-time performance and delay cause metrics for from June 2003 until February 2025. It includes detailed breakdowns of delays by category—carrier, weather, NAS (National Aviation System), security, and late-arriving aircraft—alongside data on cancellations and diversions. The information is publicly available via the BTS TranStats portal and supports analysis of trends in U.S. aviation operations.
df = pd.read_csv(data_dir + "/Airline_Delay_Cause.csv")
Data Definitions (Source)¶
| Column Name | Column Definition |
|---|---|
| year | YYYY format |
| month | MM format (1-12) |
| carrier | Code assigned by US DOT to identify a unique airline carrier. |
| carrier_name | Unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. |
| airport | A three-character alpha-numeric code issued by the U.S. Department of Transportation, which is the official designation of the airport. |
| airport_name | A place from which aircraft operate that usually has paved runways and maintenance facilities and often serves as a terminal. |
| arr_flights | Arrival Flights |
| arr_del15 | Arrival Delay Indicator (15 Minutes or More). Arrival delay equals the difference between the actual arrival time and the scheduled arrival time. A flight is considered on-time when it arrives less than 15 minutes after its published arrival time. |
| carrier_ct | Carrier Count for airline cause of delay |
| weather_ct | Weather Count for airline cause of delay |
| nas_ct | NAS (National Air System) Count for airline cause of delay |
| security_ct | Security Count for airline cause of delay |
| late_aircraft_ct | Late Aircraft Delay Count for airline cause of delay |
| arr_cancelled | Flight cancelled |
| arr_diverted | Flight diverted |
| arr_delay | Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers. |
| carrier_delay | Carrier Delay, in Minutes |
| weather_delay | Weather Delay, in Minutes |
| nas_delay | National Air System Delay, in Minutes |
| security_delay | Security Delay, in Minutes |
| late_aircraft_delay | Late Aircraft Delay, in Minutes |
Data Exploration¶
High-level Data Summary¶
print(f"Shape of data: {df.shape[0]:,} rows, {df.shape[1]:,} columns")
print(f"Columns in data: {df.columns.tolist()}")
print(f"First 5 rows of data:\n{df.head()}")
Shape of data: 400,118 rows, 21 columns
Columns in data: ['year', 'month', 'carrier', 'carrier_name', 'airport', 'airport_name', 'arr_flights', 'arr_del15', 'carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct', 'arr_cancelled', 'arr_diverted', 'arr_delay', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']
First 5 rows of data:
year month carrier carrier_name airport \
0 2025 2 9E Endeavor Air Inc. ABE
1 2025 2 9E Endeavor Air Inc. AEX
2 2025 2 9E Endeavor Air Inc. AGS
3 2025 2 9E Endeavor Air Inc. ALB
4 2025 2 9E Endeavor Air Inc. ATL
airport_name arr_flights arr_del15 \
0 Allentown/Bethlehem/Easton, PA: Lehigh Valley ... 78.0 9.0
1 Alexandria, LA: Alexandria International 78.0 12.0
2 Augusta, GA: Augusta Regional at Bush Field 91.0 13.0
3 Albany, NY: Albany International 56.0 12.0
4 Atlanta, GA: Hartsfield-Jackson Atlanta Intern... 2700.0 416.0
carrier_ct weather_ct nas_ct security_ct late_aircraft_ct \
0 5.52 0.52 2.84 0.00 0.12
1 5.77 1.62 1.75 0.00 2.86
2 2.47 0.93 4.25 0.00 5.35
3 4.34 0.34 5.32 0.00 2.00
4 83.87 16.73 143.26 0.16 171.98
arr_cancelled arr_diverted arr_delay carrier_delay weather_delay \
0 0.0 0.0 733.0 578.0 16.0
1 0.0 0.0 803.0 379.0 75.0
2 0.0 0.0 964.0 101.0 507.0
3 2.0 1.0 761.0 246.0 35.0
4 24.0 3.0 33668.0 10723.0 1790.0
nas_delay security_delay late_aircraft_delay
0 102.0 0.0 37.0
1 92.0 0.0 257.0
2 197.0 0.0 159.0
3 239.0 0.0 241.0
4 6851.0 15.0 14289.0
print(f"Data types of columns:\n{df.dtypes}")
print("--------------------------------")
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100
missing_df = pd.DataFrame({
'Missing Count': missing_counts,
'Missing %': missing_percent.round(2)
})
print("Missing values in each column:\n", missing_df)
print("--------------------------------")
print(f"Summary statistics of numerical columns:\n{df.describe()}")
Data types of columns:
year int64
month int64
carrier object
carrier_name object
airport object
airport_name object
arr_flights float64
arr_del15 float64
carrier_ct float64
weather_ct float64
nas_ct float64
security_ct float64
late_aircraft_ct float64
arr_cancelled float64
arr_diverted float64
arr_delay float64
carrier_delay float64
weather_delay float64
nas_delay float64
security_delay float64
late_aircraft_delay float64
dtype: object
--------------------------------
Missing values in each column:
Missing Count Missing %
year 0 0.00
month 0 0.00
carrier 0 0.00
carrier_name 0 0.00
airport 0 0.00
airport_name 0 0.00
arr_flights 657 0.16
arr_del15 951 0.24
carrier_ct 657 0.16
weather_ct 657 0.16
nas_ct 657 0.16
security_ct 657 0.16
late_aircraft_ct 657 0.16
arr_cancelled 657 0.16
arr_diverted 657 0.16
arr_delay 657 0.16
carrier_delay 657 0.16
weather_delay 657 0.16
nas_delay 657 0.16
security_delay 657 0.16
late_aircraft_delay 657 0.16
--------------------------------
Summary statistics of numerical columns:
year month arr_flights arr_del15 \
count 400118.000000 400118.000000 399461.000000 399167.000000
mean 2014.471181 6.502764 361.357727 69.552861
std 6.499727 3.468805 993.307701 193.465420
min 2003.000000 1.000000 1.000000 0.000000
25% 2009.000000 3.000000 55.000000 8.000000
50% 2015.000000 7.000000 113.000000 21.000000
75% 2020.000000 10.000000 254.000000 52.000000
max 2025.000000 12.000000 21977.000000 6377.000000
carrier_ct weather_ct nas_ct security_ct \
count 399461.000000 399461.000000 399461.000000 399461.000000
mean 20.529071 2.508470 22.195918 0.172298
std 48.302099 9.566042 79.382261 0.824913
min 0.000000 0.000000 -0.010000 0.000000
25% 2.780000 0.000000 1.400000 0.000000
50% 7.500000 0.510000 4.910000 0.000000
75% 18.630000 2.000000 13.980000 0.000000
max 1886.580000 717.940000 4091.270000 80.560000
late_aircraft_ct arr_cancelled arr_diverted arr_delay \
count 399461.000000 399461.000000 399461.000000 399461.000000
mean 24.095947 6.778667 0.833108 4168.764791
std 74.351714 35.028071 3.786252 12756.159221
min 0.000000 0.000000 0.000000 0.000000
25% 1.520000 0.000000 0.000000 405.000000
50% 5.490000 1.000000 0.000000 1140.000000
75% 16.140000 4.000000 1.000000 2974.000000
max 2588.130000 4951.000000 256.000000 648300.000000
carrier_delay weather_delay nas_delay security_delay \
count 399461.000000 399461.000000 399461.000000 399461.000000
mean 1310.001404 223.720794 1025.811278 7.133262
std 3829.563198 884.316576 4330.445982 39.070721
min 0.000000 0.000000 -19.000000 0.000000
25% 134.000000 0.000000 49.000000 0.000000
50% 410.000000 22.000000 183.000000 0.000000
75% 1075.000000 156.000000 555.000000 0.000000
max 321792.000000 64550.000000 238440.000000 3760.000000
late_aircraft_delay
count 399461.000000
mean 1602.090805
std 5154.067020
min 0.000000
25% 76.000000
50% 342.000000
75% 1089.000000
max 279153.000000
Insights so far
Understanding *_ct and *_delay Columns
🔹 *_ct Columns (e.g., carrier_ct, weather_ct)
Despite the name, these are not raw integer counts of delays.
They contain decimal values, indicating they are likely:
- Weighted averages (e.g., delays per flight or per day)
- Proportional allocations of delay causes
These fields are typically found in aggregated FAA datasets where counts are normalized.
Important: We cannot sum the
*_ctvalues expecting to get total delayed flights (arr_del15).
🔹 *_delay Columns (e.g., carrier_delay, weather_delay)
These represent delay durations in minutes.
In principle, their sum should match
arr_delay:arr_delay ≈ carrier_delay + weather_delay + nas_delay + security_delay + late_aircraft_delayIn practice, small mismatches may exist due to:
- Rounding
- Missing values
- Unattributed or jointly caused delays
In the next code section, we will check if the delay components add up to arr_delay and explore any mismatches.
# Step 1: Safely compute delay component sum by treating NaNs as 0
delay_components = (
df['carrier_delay'].fillna(0) +
df['weather_delay'].fillna(0) +
df['nas_delay'].fillna(0) +
df['security_delay'].fillna(0) +
df['late_aircraft_delay'].fillna(0)
)
# Step 2: Compute difference from arr_delay (only where arr_delay is not null)
valid_mask = df['arr_delay'].notna()
delay_difference = df.loc[valid_mask, 'arr_delay'] - delay_components[valid_mask]
# Step 3: Identify rows where the difference is significant (e.g., > 1 minute)
mismatched_rows = df.loc[valid_mask].loc[delay_difference.abs() > 1]
# Step 4: Summary
print(f"Total valid rows (non-null arr_delay): {valid_mask.sum()}")
print(f"Rows where delay components do NOT sum to arr_delay: {len(mismatched_rows)} ({(len(mismatched_rows) / len(df)) * 100:.2f}%)")
print("\nSample mismatches:")
print(
mismatched_rows[['arr_delay']].assign(
delay_components=delay_components[valid_mask],
difference=delay_difference
).head()
)
Total valid rows (non-null arr_delay): 399461
Rows where delay components do NOT sum to arr_delay: 9 (0.00%)
Sample mismatches:
arr_delay delay_components difference
13845 39334.0 39154.0 180.0
13848 12812.0 12686.0 126.0
25479 2773.0 1428.0 1345.0
28886 6934.0 6880.0 54.0
31638 3006.0 2775.0 231.0
Out of 399,461 valid records (i.e., rows where arr_delay is not null), only 9 rows—or less than 0.01%—have a mismatch between the sum of individual delay components (carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) and the reported arr_delay. This confirms that the delay decomposition is highly reliable across the dataset. The mismatches are relatively small in scale, with the largest observed difference being 1,345 minutes. These discrepancies could be due to rounding, reporting errors, or rare cases of unclassified delays.
Overall, the dataset's delay attribution can be considered consistent and trustworthy for modeling time series and anomaly detection.
Yearly Stats¶
import pandas as pd
# Add a 'season' column to the original DataFrame
def get_season(month):
if month in [12, 1, 2]:
return 'Winter'
elif month in [3, 4, 5]:
return 'Spring'
elif month in [6, 7, 8]:
return 'Summer'
else:
return 'Fall'
df['season'] = df['month'].apply(get_season)
# Yearly aggregated DataFrame
yearly_df = (
df.groupby('year')
.agg(
n_airports=('airport', pd.Series.nunique),
n_carriers=('carrier', pd.Series.nunique),
total_arr_flights=('arr_flights', 'sum'),
pct_delayed=('arr_del15', lambda x: x.sum() / df.loc[x.index, 'arr_flights'].sum()),
avg_arr_delay=('arr_delay', 'mean'),
avg_carrier_delay=('carrier_delay', 'mean'),
avg_weather_delay=('weather_delay', 'mean'),
avg_nas_delay=('nas_delay', 'mean'),
avg_security_delay=('security_delay', 'mean'),
avg_late_aircraft_delay=('late_aircraft_delay', 'mean')
)
.reset_index()
)
# Seasonal aggregated DataFrame
seasonal_df = (
df.groupby(['year', 'season'])
.agg(
n_airports=('airport', pd.Series.nunique),
n_carriers=('carrier', pd.Series.nunique),
total_arr_flights=('arr_flights', 'sum'),
pct_delayed=('arr_del15', lambda x: x.sum() / df.loc[x.index, 'arr_flights'].sum()),
avg_arr_delay=('arr_delay', 'mean'),
avg_carrier_delay=('carrier_delay', 'mean'),
avg_weather_delay=('weather_delay', 'mean'),
avg_nas_delay=('nas_delay', 'mean'),
avg_security_delay=('security_delay', 'mean'),
avg_late_aircraft_delay=('late_aircraft_delay', 'mean')
)
.reset_index()
)
# Optional: Ensure natural season order
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_df['season'] = pd.Categorical(seasonal_df['season'], categories=season_order, ordered=True)
seasonal_df = seasonal_df.sort_values('season')
# Plot 1: Number of Flights
plot_line(yearly_df, x='year', y='total_arr_flights',
title='Number of Flights per Year', ylabel='Flights')
plot_line(seasonal_df, x='year', y='total_arr_flights', hue='season',
title='Number of Flights per Year by Season', ylabel='Flights')
# Plot 2: Number of Airports
plot_line(yearly_df, x='year', y='n_airports',
title='Number of Airports per Year', ylabel='Airports')
plot_line(seasonal_df, x='year', y='n_airports', hue='season',
title='Number of Airports per Year by Season', ylabel='Airports')
# Plot 3: Number of Carriers
plot_line(yearly_df, x='year', y='n_carriers',
title='Number of Carriers per Year', ylabel='Carriers')
plot_line(seasonal_df, x='year', y='n_carriers', hue='season',
title='Number of Carriers per Year by Season', ylabel='Carriers')
# Plot 4: Percentage of Flights Delayed
plot_line(yearly_df, x='year', y='pct_delayed',
title='Percentage of Flights Delayed per Year', ylabel='Percent Delayed')
plot_line(seasonal_df, x='year', y='pct_delayed', hue='season',
title='Percentage of Flights Delayed per Year by Season', ylabel='Percent Delayed')
# Plot 5: Average Arrival Delay
plot_line(yearly_df, x='year', y='avg_arr_delay',
title='Average Arrival Delay per Year', ylabel='Avg Delay (minutes)')
plot_line(seasonal_df, x='year', y='avg_arr_delay', hue='season',
title='Average Arrival Delay per Year by Season', ylabel='Avg Delay (minutes)')
Some insights so far:
Data prior to 2005
- The dataset is incomplete for years before 2005, with only a few carriers represented.
- Data prior to this will be excluded
Impact of Covid-19
- The data in 2020 seems to be heavily impacted by the Covid-19 pandemic, with significant drops in flight activity and delays.
- Data in 2023 seems to have returned to pre-pandemic levels, but with some lingering effects.
- Data in 2025 is incomplete therefore will not be included in this analysis.
However, we can see that there is some anomaly in the number of airports and carriers in 2018. The spike seems too high and it warrants further investigation.
Flights of Carriers per year¶
# Pivot table with carriers on rows and years on columns
carrier_year_matrix = (
df.groupby(['carrier', 'carrier_name', 'year'])
.size()
.reset_index(name='n_rows')
.pivot_table(
index=['carrier', 'carrier_name'],
columns='year',
values='n_rows',
fill_value=0
)
)
# Create a readable label: "XX – Carrier Name"
carrier_year_matrix.index = carrier_year_matrix.index.map(
lambda x: f"{x[0]} – {x[1]}"
)
# Create heatmap
plt.figure(figsize=(10, int(len(carrier_year_matrix) * 0.4)))
sns.heatmap(
carrier_year_matrix,
cmap="YlGnBu",
linewidths=0.5,
linecolor='gray',
cbar_kws={'label': 'Row Count',
'shrink': 0.5,
'aspect': 20,
},
square=False,
label=True,
)
plt.title("Carrier Activity by Year", fontsize=16, fontweight='bold')
plt.xlabel("Year")
plt.ylabel("Carrier")
plt.tick_params(axis='y', labelsize=8)
plt.tick_params(axis='x', labelsize=8)
plt.tight_layout()
plt.show()
The dataset reveals several structural inconsistencies that should be considered before conducting trend analysis or forecasting:
Missing Historical Presence
- Some long-established carriers (e.g., Piedmont Airlines) only appear in recent years.
- This likely reflects changes in reporting scope or operational status (e.g., reactivation after mergers).
Carrier Code vs. Carrier Name Mismatch
The same carrier code is reused over time but tied to different carrier names.
- Example:
"EV"appears as both ExpressJet Airlines Inc. and ExpressJet Airlines LLC.
- Example:
This creates ambiguity in time series analysis, especially when aggregating or tracking carrier performance longitudinally.
Irregular or Short-Lived Carriers
- Some carriers operate only for limited periods or on seasonal routes (e.g., Endeavor Air Inc, Peninsula Airways, Mesa Airlines Inc.).
- These entries can distort long-term trends if not filtered or labeled appropriately.
For this analysis, we will:
- Combine the carrier names filghts into the same carrier code
- Focus only on carriers with a consistent presence across the dataset, e.g. Delta Airlines, JetBlue, United, etc.
Flights in Each Airport per Year¶
# Pivot table with airports on rows and years on columns
airport_year_matrix = (
df.groupby(['airport', 'airport_name', 'year'])
.size()
.reset_index(name='n_rows')
.pivot_table(
index=['airport', 'airport_name'],
columns='year',
values='n_rows',
fill_value=0
)
)
# Create a readable label: "XXX – Airport Name"
airport_year_matrix.index = airport_year_matrix.index.map(
lambda x: f"{x[0]} – {x[1]}"
)
# Transpose the matrix: now years = rows, airports = columns
airport_year_matrix = airport_year_matrix.T
fig = px.imshow(
airport_year_matrix,
labels=dict(x="Airport", y="Year", color="Row Count"),
aspect="auto",
color_continuous_scale="YlGnBu"
)
fig.update_layout(
width=3000, # adjust as needed
height=600,
title="Airport Activity by Year",
xaxis_tickangle=45,
xaxis=dict(
tickfont=dict(size=8) # smaller x-axis font
),
yaxis=dict(
tickfont=dict(size=10) # smaller y-axis font
)
)
fig.show()
Similar to the carrier data, the airport data also has some inconsistencies that should be considered before conducting trend analysis or forecasting:
- Airports like PVD in Providence, RI, and LAS Las Vegas show different airports but are actually the same airport. In the case of LAS, it is a common mistake to use the airport code for McCarran International Airport, which is now Harry Reid International Airport.
For this analysis, we will only consider flights from airports with high flight activity and a consistent presence across the dataset, e.g. LAS, LGA, MCI, etc.
Note: The plotly above is zoomed to showcase similar issues in the airport data, you can click on the Zoom controls and explore other airports.
Popular Carriers¶
To make this analysis more focused, we will only consider the carriers with the most consistent appearance in the dataset, i.e. the ones that appeared from the start until the end of the dataset.
# Step 1: Total number of (year, month) combinations
total_months = df[['year', 'month']].drop_duplicates().shape[0]
# Step 2: Find in which year/month each carrier is present
carrier_presence = df.groupby(['carrier', 'year', 'month'])['arr_flights'].sum().reset_index()
# Step 3: Count active months for each carrier
carrier_months_present = (
carrier_presence.groupby(['carrier'])
.size()
.reset_index(name='active_months')
)
# Step 4: Calculate consistency (% of total months)
carrier_months_present['consistency_pct'] = (
carrier_months_present['active_months'] / total_months * 100
)
# Step 5: Total number of flights per carrier
total_flights = (
df.groupby(['carrier'])['arr_flights']
.sum()
.reset_index(name='total_flights')
)
# Step 6: Merge and sort
carrier_stats = carrier_months_present.merge(total_flights, on=['carrier'])
carrier_stats = carrier_stats.sort_values(by='total_flights', ascending=False).reset_index(drop=True)
# Optional: display top 10
print(carrier_stats.head(10))
carrier active_months consistency_pct total_flights 0 WN 261 100.000000 25921109.0 1 DL 261 100.000000 16368925.0 2 AA 261 100.000000 15571282.0 3 OO 261 100.000000 13666764.0 4 UA 261 100.000000 11249820.0 5 MQ 237 90.804598 7659466.0 6 EV 208 79.693487 6494029.0 7 US 145 55.555556 5188612.0 8 B6 261 100.000000 4738419.0 9 AS 261 100.000000 3901738.0
We can see the top 5 carriers with 100% flight activity in the dataset are:
| Carrier Code | Carrier Name |
|---|---|
| WN | Southwest Airlines |
| DL | Delta Air Lines |
| AA | American Airlines |
| OO | SkyWest Airlines |
| UA | United Airlines |
These carriers will be used for this analysis.
Data Wrangling: Filtering¶
We will filter the dataset to include only the following:
- Years: 2005 onwards
- Top 5 Carriers that appeared across all time periods: WN, DL, AA, OO, UA
df_clean = df[(df['carrier'].isin(['WN', 'DL', 'AA', 'OO', 'UA'])) & (df['year'] >= 2005)]
print(f"Original Rows {df.shape[0]:,}, Cleaned Rows {df_clean.shape[0]:,} ({(df_clean.shape[0] / df.shape[0]) * 100:.2f}% of original)")
Original Rows 400,118, Cleaned Rows 138,196 (34.54% of original)
Monthly stats¶
We can now start to explore the monthly stats of the dataset using the filtered dataset.
monthly_stats = (
df_clean.groupby(['year', 'month'])
.agg(
n_rows=('airport', 'count'),
n_airports=('airport', pd.Series.nunique),
n_carriers=('carrier', pd.Series.nunique),
total_arr_flights=('arr_flights', 'sum'),
total_arr_del15=('arr_del15', 'sum'),
total_arr_delay=('arr_delay', 'sum'),
avg_arr_delay=('arr_delay', 'mean'),
max_arr_delay=('arr_delay', 'max'),
avg_carrier_delay=('carrier_delay', 'mean'),
avg_weather_delay=('weather_delay', 'mean'),
avg_nas_delay=('nas_delay', 'mean'),
avg_security_delay=('security_delay', 'mean'),
avg_late_aircraft_delay=('late_aircraft_delay', 'mean'),
total_cancelled=('arr_cancelled', 'sum'),
total_diverted=('arr_diverted', 'sum')
)
.reset_index()
)
# Derived metrics
monthly_stats['pct_delayed'] = (
monthly_stats['total_arr_del15'] / monthly_stats['total_arr_flights']
)
monthly_stats['cancel_rate'] = (
monthly_stats['total_cancelled'] / monthly_stats['total_arr_flights']
)
monthly_stats['divert_rate'] = (
monthly_stats['total_diverted'] / monthly_stats['total_arr_flights']
)
# Create a datetime column for monthly x-axis
monthly_stats['date'] = pd.to_datetime(
monthly_stats[['year', 'month']].assign(day=1)
)
display(monthly_stats)
| year | month | n_rows | n_airports | n_carriers | total_arr_flights | total_arr_del15 | total_arr_delay | avg_arr_delay | max_arr_delay | avg_carrier_delay | avg_weather_delay | avg_nas_delay | avg_security_delay | avg_late_aircraft_delay | total_cancelled | total_diverted | pct_delayed | cancel_rate | divert_rate | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2005 | 1 | 457 | 198 | 5 | 284514.0 | 66785.0 | 3600500.0 | 7878.555799 | 197239.0 | 2041.179431 | 589.183807 | 2236.919037 | 10.877462 | 3000.396061 | 10656.0 | 888.0 | 0.234734 | 0.037453 | 0.003121 | 2005-01-01 |
| 1 | 2005 | 2 | 453 | 196 | 5 | 261124.0 | 49000.0 | 2375416.0 | 5243.743929 | 221659.0 | 1366.306843 | 193.902870 | 1691.086093 | 6.485651 | 1985.962472 | 3491.0 | 368.0 | 0.187650 | 0.013369 | 0.001409 | 2005-02-01 |
| 2 | 2005 | 3 | 455 | 196 | 5 | 291972.0 | 56796.0 | 2957964.0 | 6501.019780 | 332144.0 | 1719.316484 | 255.615385 | 2000.903297 | 15.094505 | 2510.090110 | 3461.0 | 339.0 | 0.194526 | 0.011854 | 0.001161 | 2005-03-01 |
| 3 | 2005 | 4 | 462 | 197 | 5 | 280165.0 | 37835.0 | 1719410.0 | 3721.666667 | 169481.0 | 1154.965368 | 136.822511 | 1057.541126 | 7.158009 | 1365.179654 | 3167.0 | 306.0 | 0.135045 | 0.011304 | 0.001092 | 2005-04-01 |
| 4 | 2005 | 5 | 446 | 193 | 5 | 288952.0 | 40307.0 | 1995596.0 | 4474.430493 | 143180.0 | 1310.506726 | 168.795964 | 1265.957399 | 4.589686 | 1724.580717 | 2333.0 | 374.0 | 0.139494 | 0.008074 | 0.001294 | 2005-05-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 237 | 2024 | 10 | 728 | 279 | 5 | 431823.0 | 55552.0 | 3374479.0 | 4635.273352 | 163335.0 | 2114.288462 | 159.078297 | 651.013736 | 7.344780 | 1703.548077 | 3770.0 | 503.0 | 0.128645 | 0.008730 | 0.001165 | 2024-10-01 |
| 238 | 2024 | 11 | 724 | 278 | 5 | 399791.0 | 58548.0 | 3722939.0 | 5142.180939 | 206051.0 | 2015.313536 | 310.917127 | 1035.198895 | 7.893646 | 1772.857735 | 2004.0 | 629.0 | 0.146447 | 0.005013 | 0.001573 | 2024-11-01 |
| 239 | 2024 | 12 | 729 | 276 | 5 | 407919.0 | 82360.0 | 5864330.0 | 8044.348422 | 303265.0 | 3021.779150 | 637.676269 | 1291.849108 | 10.969822 | 3082.074074 | 2629.0 | 1117.0 | 0.201903 | 0.006445 | 0.002738 | 2024-12-01 |
| 240 | 2025 | 1 | 721 | 273 | 5 | 383734.0 | 67648.0 | 4839843.0 | 6712.680999 | 305779.0 | 2626.022191 | 633.302358 | 1247.434119 | 8.276006 | 2197.646325 | 10477.0 | 860.0 | 0.176289 | 0.027303 | 0.002241 | 2025-01-01 |
| 241 | 2025 | 2 | 712 | 273 | 5 | 358818.0 | 69334.0 | 4890789.0 | 6869.085674 | 219491.0 | 2578.532303 | 489.320225 | 1420.228933 | 6.971910 | 2374.032303 | 2858.0 | 674.0 | 0.193229 | 0.007965 | 0.001878 | 2025-02-01 |
242 rows × 21 columns
plot_line(
data=monthly_stats,
x='date',
y='pct_delayed',
title='% of Flights Delayed (15+ min) Over Time',
ylabel='Percent Delayed'
)
- Delay rates increased steadily from 2003 to around 2008, likely due to rising air traffic and limited infrastructure.
- From 2008 to 2015, delay rates showed high volatility but no consistent trend, possibly due to airline mergers and operational adjustments.
- Clear seasonal patterns appear each year, with delays peaking in winter and summer months, indicating strong seasonality.
- A sharp decline in delay percentage occurred in early 2020 due to the COVID-19 pandemic and associated drop in flight volume.
- Post-2020, delay rates gradually returned with noticeable fluctuations, suggesting an uneven recovery and potential operational strain.
- The current (2023–2024) delay levels have reached pre-COVID variability, showing a full rebound in system usage and disruption patterns.
plot_line(
data=monthly_stats,
x='date',
y='avg_arr_delay',
title='Average Arrival Delay Over Time',
ylabel='Avg Delay (minutes)'
)
plt.figure(figsize=(14, 6))
delay_cols = {
'avg_carrier_delay': 'Carrier',
'avg_weather_delay': 'Weather',
'avg_nas_delay': 'NAS',
'avg_security_delay': 'Security',
'avg_late_aircraft_delay': 'Late Aircraft'
}
for col, label in delay_cols.items():
sns.lineplot(data=monthly_stats, x='date', y=col, label=label, linewidth=2.5)
plt.title('Average Delay by Cause Over Time', fontsize=18, fontweight='bold')
plt.ylabel('Delay (minutes)')
plt.xlabel('')
plt.legend(title='Delay Cause')
plt.xticks(rotation=45)
sns.despine()
plt.tight_layout()
plt.show()
- Late aircraft delays consistently cause the highest average monthly delay, making it the dominant contributor across all years.
- Carrier-related delays are the second most significant cause, closely following late aircraft delays in trend and seasonality.
- NAS (National Aviation System) delays also contribute significantly, with moderate seasonal fluctuations, and appear to increase again post-2020.
- Weather delays are relatively low but highly seasonal, with visible spikes likely aligned with winter storms or summer disruptions.
- Security delays are negligible across the entire timeline and show minimal variance, indicating they have little impact on total delays.
- A steep drop in all delay causes occurred in early 2020, reflecting the COVID-19 impact — followed by a sharp recovery and increased variability in 2022–2024.
- Post-COVID, late aircraft and carrier delays show more volatility and sharper spikes, suggesting growing operational strain or irregularities in flight schedules.
Time Series Analysis¶
We can now start to analyze the time series data using the filtered dataset.
Simple Moving Average¶
# Step 1: Aggregate total delay across all carriers and airports by month
monthly_overall_delay = (
df_clean.groupby(['year', 'month'])['arr_delay']
.sum()
.reset_index()
.sort_values(['year', 'month'])
)
# Step 2: Create a proper date column for sorting and plotting
monthly_overall_delay['date'] = pd.to_datetime(
monthly_overall_delay[['year', 'month']].assign(day=1)
)
# Compute 5-month SMA and residuals
monthly_overall_delay['SMA'] = monthly_overall_delay['arr_delay'].rolling(window=5).mean()
monthly_overall_delay['diff'] = monthly_overall_delay['arr_delay'] - monthly_overall_delay['SMA']
# Line chart: arr_delay vs SMA
fig1 = go.Figure()
fig1.add_trace(go.Scatter(
x=monthly_overall_delay['date'], y=monthly_overall_delay['arr_delay'],
mode='lines', name='Arrival Delay'
))
fig1.add_trace(go.Scatter(
x=monthly_overall_delay['date'], y=monthly_overall_delay['SMA'],
mode='lines', name='5-month SMA'
))
fig1.update_layout(
title='Monthly Arrival Delay vs 10-Month SMA',
xaxis_title='Date',
yaxis_title='Total Arrival Delay (minutes)',
template='plotly_white',
height=500
)
fig1.show()
fig2 = px.histogram(
monthly_overall_delay,
x='diff',
nbins=30,
title='Distribution of Delay Residuals (Actual - SMA)',
labels={'diff': 'Residual (minutes)'},
template='plotly_white'
)
fig2.update_layout(
xaxis_title='Delay Residual',
yaxis_title='Frequency',
bargap=0.05
)
fig2.show()
Based on the graph, it shows that the majority of delays are within the range of -1M to 1M minutes ever month.
monthly_overall_delay['upper'] = monthly_overall_delay['SMA'] + 1000000
monthly_overall_delay['lower'] = monthly_overall_delay['SMA'] - 1000000
fig = go.Figure()
# Actual data points
fig.add_trace(go.Scatter(
x=monthly_overall_delay['date'],
y=monthly_overall_delay['arr_delay'],
mode='markers',
marker=dict(size=4, color='green'),
name='Actual'
))
# SMA line
fig.add_trace(go.Scatter(
x=monthly_overall_delay['date'],
y=monthly_overall_delay['SMA'],
mode='lines',
line=dict(color='blue'),
name='SMA'
))
# Upper bound (invisible, used to create shaded area)
fig.add_trace(go.Scatter(
x=monthly_overall_delay['date'],
y=monthly_overall_delay['upper'],
line=dict(width=0),
showlegend=False,
hoverinfo='skip'
))
# Lower bound with fill to upper bound
fig.add_trace(go.Scatter(
x=monthly_overall_delay['date'],
y=monthly_overall_delay['lower'],
fill='tonexty',
fillcolor='rgba(255, 0, 0, 0.2)',
line=dict(width=0),
name='Prediction Interval',
hoverinfo='skip'
))
fig.update_layout(
title="Arrival Delay with SMA and Prediction Interval",
xaxis_title="Date",
yaxis_title="Total Arrival Delay (minutes)",
template='plotly_white',
height=500
)
fig.show()
We can see that the majority of 'anomalies' are happening recently since Covid-19. This is understandable since the aviation industry is still recovering from the pandemic and there are still some irregularities in the flight schedules.
Performance Evaluation¶
valid_bounds = monthly_overall_delay.dropna(subset=['arr_delay', 'lower', 'upper'])
# Calculate how many points are outside the bounds
outside_bounds = valid_bounds[
(valid_bounds['arr_delay'] < valid_bounds['lower']) |
(valid_bounds['arr_delay'] > valid_bounds['upper'])
]
# Output the results
n_total = len(valid_bounds)
n_outside = len(outside_bounds)
pct_outside = n_outside / n_total * 100
print(f"Total valid rows: {n_total}")
print(f"Rows outside SMA bounds: {n_outside} ({pct_outside:.2f}%)")
valid_sma = monthly_overall_delay.dropna(subset=['arr_delay', 'SMA'])
mape_sma = mean_absolute_percentage_error(valid_sma['arr_delay'], valid_sma['SMA'])
mse_sma = mean_squared_error(valid_sma['arr_delay'], valid_sma['SMA'])
print(f"MAPE (SMA model): {mape_sma*100:.2f}%")
print(f"MSE (SMA model): {mse_sma:,.0f}")
Total valid rows: 238 Rows outside SMA bounds: 74 (31.09%) MAPE (SMA model): 28.66% MSE (SMA model): 1,088,855,420,221
Exponential Smoothing¶
# Fit Exponential Smoothing model
ema_model = SimpleExpSmoothing(monthly_overall_delay['arr_delay']).fit(smoothing_level=0.2, optimized=False)
# Predict full in-sample values
monthly_overall_delay['EMA'] = ema_model.fittedvalues
monthly_overall_delay['diff'] = monthly_overall_delay['arr_delay'] - monthly_overall_delay['EMA']
# Forecast the next 3 months
forecast_3 = ema_model.forecast(3)
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=monthly_overall_delay['date'], y=monthly_overall_delay['arr_delay'],
mode='lines', name='Actual'))
fig1.add_trace(go.Scatter(x=monthly_overall_delay['date'], y=monthly_overall_delay['EMA'],
mode='lines', name='EMA (α=0.2)'))
fig1.update_layout(title="Actual vs EMA", xaxis_title="Date", yaxis_title="Arrival Delay", template="plotly_white")
fig1.show()
fig2 = px.histogram(monthly_overall_delay, x='diff', nbins=30, title='Distribution of Residuals (Actual - EMA)',
labels={'diff': 'Residuals'}, template='plotly_white')
fig2.update_layout(bargap=0.05)
fig2.show()
monthly_overall_delay['ema_upper'] = monthly_overall_delay['EMA'] + 1000000
monthly_overall_delay['ema_lower'] = monthly_overall_delay['EMA'] - 1000000
fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=monthly_overall_delay['date'], y=monthly_overall_delay['arr_delay'],
mode='markers', name='Actual', marker=dict(size=4, color='green')))
fig3.add_trace(go.Scatter(x=monthly_overall_delay['date'], y=monthly_overall_delay['EMA'],
mode='lines', name='EMA', line=dict(color='blue')))
fig3.add_trace(go.Scatter(x=monthly_overall_delay['date'], y=monthly_overall_delay['ema_upper'],
line=dict(width=0), showlegend=False))
fig3.add_trace(go.Scatter(x=monthly_overall_delay['date'], y=monthly_overall_delay['ema_lower'],
fill='tonexty', fillcolor='rgba(255,0,0,0.2)', line=dict(width=0),
name='Prediction Interval'))
fig3.update_layout(title="Arrival Delay with EMA and Interval", template="plotly_white")
fig3.show()
Similar to the Simple Moving Average, if we define the average forecasted delays are normally within the range of -1M to 1M minutes every month, the Exponential Smoothing graph shows that since Covid-19 there are a lot of delays considered anomalies. Additionally, a simple count of anomalies (forecasted delays outside the range of -1M to 1M minutes) shows that using Exponential Smoothing, there are more anomalies delays than using Simple Moving Average.
Performance Evaluation¶
# Ensure no nulls in the relevant columns
valid_bounds = monthly_overall_delay.dropna(subset=['arr_delay', 'ema_lower', 'ema_upper'])
# Calculate how many points are outside the bounds
outside_bounds = valid_bounds[
(valid_bounds['arr_delay'] < valid_bounds['ema_lower']) |
(valid_bounds['arr_delay'] > valid_bounds['ema_upper'])
]
# Output the results
n_total = len(valid_bounds)
n_outside = len(outside_bounds)
pct_outside = n_outside / n_total * 100
print(f"Total valid rows: {n_total}")
print(f"Rows outside EMA bounds: {n_outside} ({pct_outside:.2f}%)")
# Drop rows where EMA is missing (should only be at the beginning if any)
valid_ema = monthly_overall_delay.dropna(subset=['arr_delay', 'EMA'])
# Calculate MAPE and MSE
mape_ema = mean_absolute_percentage_error(valid_ema['arr_delay'], valid_ema['EMA'])
mse_ema = mean_squared_error(valid_ema['arr_delay'], valid_ema['EMA'])
print(f"MAPE (EMA model): {mape_ema*100:.2f}%")
print(f"MSE (EMA model): {mse_ema:,.0f}")
Total valid rows: 242 Rows outside EMA bounds: 91 (37.60%) MAPE (EMA model): 36.84% MSE (EMA model): 1,465,865,459,636
Seasonal-Trend Decomposition (STD)¶
# Make sure date is datetime and sorted
monthly_overall_delay = monthly_overall_delay.sort_values('date')
monthly_overall_delay.set_index('date', inplace=True)
# Decompose with additive model
decomposition = sm.tsa.seasonal_decompose(monthly_overall_delay['arr_delay'], model='additive', period=12) # since our data is monthly, one full cycle is 12 months
# Extract components
trend = decomposition.trend
seasonal = decomposition.seasonal
resid = decomposition.resid
# Create Plotly subplots
fig = make_subplots(rows=3, cols=1, shared_xaxes=True,
vertical_spacing=0.03,
subplot_titles=("Trend", "Seasonality", "Residual"))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'), row=1, col=1)
fig.add_trace(go.Scatter(x=seasonal.index, y=seasonal, mode='lines', name='Seasonal'), row=2, col=1)
fig.add_trace(go.Scatter(x=resid.index, y=resid, mode='lines', name='Residual'), row=3, col=1)
fig.update_layout(
height=700,
title_text="Seasonal-Trend Decomposition (Additive)",
template="plotly_white",
showlegend=False
)
fig.show()
Below are some insights from the Seasonal-Trend Decomposition (STD) analysis:
- Trend shows a steady increase in total arrival delays from 2005 to 2019
- A sharp drop in delays is observed around 2020 due to the COVID-19 pandemic
- After 2020, delays rise again and surpass pre-pandemic levels by 2024
- Seasonality is consistent with clear annual cycles
- Peaks in delays occur mid-year, likely during summer travel months
- Troughs occur during winter months when travel is lower
- Residuals are mostly stable across time
- Large residual spikes appear since 2020, indicating abnormal conditions, i.e. Covid-19 impact
We will now continue using the Prophet module to model the time series data and forecast future delays. But to do that, we would want to add some additional features to the dataset to improve the model's accuracy.
Performance Evaluation¶
# Reconstruct fitted values
fitted = trend + seasonal
# Drop NaNs (common in trend edges)
eval_df = monthly_overall_delay[['arr_delay']].copy()
eval_df['fitted'] = fitted
eval_df = eval_df.dropna()
# Compute errors
mse = mean_squared_error(eval_df['arr_delay'], eval_df['fitted'])
mape = mean_absolute_percentage_error(eval_df['arr_delay'], eval_df['fitted'])
# Define static margin (e.g., ±1.5 million)
margin = 1_000_000 # same boundary as used in EMA
eval_df['upper'] = eval_df['fitted'] + margin
eval_df['lower'] = eval_df['fitted'] - margin
# Count how many points fall outside the bounds
outside_bounds = eval_df[
(eval_df['arr_delay'] > eval_df['upper']) |
(eval_df['arr_delay'] < eval_df['lower'])
]
n_total = len(eval_df)
n_outside = len(outside_bounds)
pct_outside = n_outside / n_total * 100
# Display results
print(f"STD model performance:")
print(f"- MAPE: {mape:.2%}")
print(f"- MSE: {mse:,.2f}")
print(f"- Points outside ±{margin:,} bounds: {n_outside} out of {n_total} ({pct_outside:.2f}%)")
STD model performance: - MAPE: 21.00% - MSE: 462,049,177,386.65 - Points outside ±1,000,000 bounds: 25 out of 230 (10.87%)
Data Wrangling: Add Features¶
1. Holidays per month¶
# Define the date range
start_date = '2005-01-01'
end_date = '2025-05-31'
# Create calendar and get holidays in that range
calendar = USFederalHolidayCalendar()
holidays = calendar.holidays(start=start_date, end=end_date)
# Convert to DataFrame with year and month
holidays_df = pd.DataFrame({'date': holidays})
holidays_df['year'] = holidays_df['date'].dt.year
holidays_df['month'] = holidays_df['date'].dt.month
# Count holidays by (year, month)
monthly_flags = (
holidays_df.groupby(['year', 'month'])
.size()
.reset_index(name='num_holidays')
)
# Fill in missing (year, month) pairs with 0 holidays
all_months = pd.MultiIndex.from_product(
[range(2005, 2026), range(1, 13)],
names=['year', 'month']
)
monthly_flags = (
monthly_flags
.set_index(['year', 'month'])
.reindex(all_months, fill_value=0)
.reset_index()
)
# Preview
print(monthly_flags.head(12)) # first year
year month num_holidays 0 2005 1 1 1 2005 2 1 2 2005 3 0 3 2005 4 0 4 2005 5 1 5 2005 6 0 6 2005 7 1 7 2005 8 0 8 2005 9 1 9 2005 10 1 10 2005 11 2 11 2005 12 1
2. Holiday/travel season flag¶
Summer travel: June, July, August
Holiday travel: November (Thanksgiving), December (Christmas/New Year)
Source: TSA Throughput Data, DOT Bureau of Transportation Statistics
monthly_flags['travel_season_flag'] = monthly_flags['month'].isin([6, 7, 8, 11, 12]).astype(int)
# Preview
print(monthly_flags.head(12)) # first year
year month num_holidays travel_season_flag 0 2005 1 1 0 1 2005 2 1 0 2 2005 3 0 0 3 2005 4 0 0 4 2005 5 1 0 5 2005 6 0 1 6 2005 7 1 1 7 2005 8 0 1 8 2005 9 1 0 9 2005 10 1 0 10 2005 11 2 1 11 2005 12 1 1
3. Disaster flag¶
Flag for major disasters (e.g., hurricanes, wildfires) that could impact air travel
monthly_flags['date'] = pd.to_datetime(monthly_flags[['year', 'month']].assign(day=1))
# 1. COVID-19: March 2020 to March 2022
monthly_flags['covid_flag'] = (
(monthly_flags['date'] >= '2020-03-01') & (monthly_flags['date'] <= '2025-05-01')
).astype(int)
# 2. Recession (Dec 2007 – Jun 2009)
monthly_flags['recession_flag'] = (
(monthly_flags['date'] >= '2007-12-01') & (monthly_flags['date'] <= '2009-06-01')
).astype(int)
# 3. Fuel spike periods
fuel_spike_periods = [
('2008-04-01', '2008-10-01'),
('2012-02-01', '2012-04-01'),
('2022-03-01', '2022-07-01')
]
monthly_flags['fuel_spike_flag'] = 0
for start, end in fuel_spike_periods:
monthly_flags.loc[
(monthly_flags['date'] >= start) & (monthly_flags['date'] <= end),
'fuel_spike_flag'
] = 1
# 4. Natural disaster months
natural_disaster_months = [
'2005-08', # Hurricane Katrina
'2008-09', # Hurricane Ike
'2011-08', # Hurricane Irene
'2012-10', # Hurricane Sandy
'2017-08', # Hurricane Harvey
'2017-09', # Hurricane Irma
'2018-09', # Florence
'2018-10', # Michael
'2020-08', # Laura
'2021-02', # Texas Snowstorm
'2021-08', # Ida
'2022-09', # Hurricane Ian
"2025-03", # 2025 Tornado Outbreak
"2025-06", # 2025 Southern California Wildfires
"2024-09", # 2024 Hurricane Helene
"2023-08", # 2023 Hawaii Wildfires
"2023-08" # 2023 Hurricane Idalia
]
monthly_flags['natural_disaster_flag'] = (
monthly_flags['date'].dt.strftime('%Y-%m').isin(natural_disaster_months)
).astype(int)
# Drop date column if not needed
monthly_flags.drop(columns='date', inplace=True)
# Preview
print(monthly_flags.head())
year month num_holidays travel_season_flag covid_flag recession_flag \ 0 2005 1 1 0 0 0 1 2005 2 1 0 0 0 2 2005 3 0 0 0 0 3 2005 4 0 0 0 0 4 2005 5 1 0 0 0 fuel_spike_flag natural_disaster_flag 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0
Prophet¶
# Step 1: Prepare data for Prophet
# Prophet expects columns named 'ds' for date and 'y' for target
prophet_df = monthly_overall_delay.reset_index()[['date', 'arr_delay']].rename(columns={
'date': 'ds',
'arr_delay': 'y'
})
# Step 2: Fit Prophet model
model = Prophet(
yearly_seasonality=True,
daily_seasonality=False,
weekly_seasonality=False
)
model.fit(prophet_df)
# Step 3: Create future dataframe (e.g., 12 months ahead)
future = model.make_future_dataframe(periods=6, freq='MS') # 'MS' = month start
forecast = model.predict(future)
# Step 4: Plot forecast (matplotlib)
fig1 = model.plot(forecast)
# Step 5: Plot components (trend, seasonality, etc.)
fig2 = model.plot_components(forecast)
21:47:53 - cmdstanpy - INFO - Chain [1] start processing 21:47:53 - cmdstanpy - INFO - Chain [1] done processing
The trend component shows a steady increase in arrival delays from 2005 onward, with a noticeable acceleration after 2020.
The seasonality component reveals a strong yearly cycle, with delays typically peaking around July and dipping sharply in November, consistent with summer travel surges and reduced traffic during late fall.
Overall, Prophet captures both the long-term growth in delays and recurring annual patterns, supporting the presence of consistent seasonal effects in flight delay behavior.
Performance evaluation¶
# Merge actuals and forecast on 'ds'
eval_df = forecast.merge(prophet_df[['ds', 'y']], on='ds', how='left')
# Drop rows with missing actuals (only keep rows within training period)
eval_df = eval_df.dropna(subset=['y'])
# Count how many actual points fall outside the prediction interval
outside_bounds = eval_df[
(eval_df['y'] < eval_df['yhat_lower']) |
(eval_df['y'] > eval_df['yhat_upper'])
]
# Summary stats
n_total = len(eval_df)
n_outside = len(outside_bounds)
pct_outside = n_outside / n_total * 100
print(f"Total points evaluated: {n_total}")
print(f"Points outside prediction interval: {n_outside} ({pct_outside:.2f}%)")
# Merge actuals and predictions on date
eval_df = prophet_df[['ds', 'y']].merge(
forecast[['ds', 'yhat']], on='ds', how='left'
)
# Drop rows with missing predictions (e.g., if yhat doesn't cover all ds)
eval_df = eval_df.dropna(subset=['y', 'yhat'])
# Calculate metrics
mape = mean_absolute_percentage_error(eval_df['y'], eval_df['yhat'])
mse = mean_squared_error(eval_df['y'], eval_df['yhat'])
print(f"MAPE: {mape:.4f}")
print(f"MSE: {mse:,.0f}")
Total points evaluated: 242 Points outside prediction interval: 34 (14.05%) MAPE: 0.3940 MSE: 1,147,133,606,165
Prophet with Additional Features¶
prophet_df = monthly_overall_delay.reset_index().merge(
monthly_flags,
on=['year', 'month'],
how='left'
)
prophet_df = prophet_df.rename(columns={
'date': 'ds',
'arr_delay': 'y'
})[['ds', 'y', 'num_holidays', 'travel_season_flag', 'covid_flag',
'recession_flag', 'fuel_spike_flag', 'natural_disaster_flag']]
model = Prophet(yearly_seasonality=True, daily_seasonality=False, weekly_seasonality=False)
# Add all external regressors
model.add_regressor('num_holidays')
model.add_regressor('travel_season_flag')
model.add_regressor('covid_flag')
model.add_regressor('recession_flag')
model.add_regressor('fuel_spike_flag')
model.add_regressor('natural_disaster_flag')
model.fit(prophet_df)
future = model.make_future_dataframe(periods=6, freq='MS')
future = future.merge(monthly_flags,
left_on=[future.ds.dt.year, future.ds.dt.month],
right_on=['year', 'month'],
how='left')
future = future[['ds', 'num_holidays', 'travel_season_flag', 'covid_flag',
'recession_flag', 'fuel_spike_flag', 'natural_disaster_flag']]
forecast = model.predict(future)
plot_plotly(model, forecast)
21:47:54 - cmdstanpy - INFO - Chain [1] start processing 21:47:54 - cmdstanpy - INFO - Chain [1] done processing
plot_components_plotly(model, forecast)
Performance Evaluation¶
# Merge actuals and forecast on 'ds'
eval_df = forecast.merge(prophet_df[['ds', 'y']], on='ds', how='left')
# Drop rows with missing actuals (only keep rows within training period)
eval_df = eval_df.dropna(subset=['y'])
# Count how many actual points fall outside the prediction interval
outside_bounds = eval_df[
(eval_df['y'] < eval_df['yhat_lower']) |
(eval_df['y'] > eval_df['yhat_upper'])
]
# Summary stats
n_total = len(eval_df)
n_outside = len(outside_bounds)
pct_outside = n_outside / n_total * 100
print(f"Total points evaluated: {n_total}")
print(f"Points outside prediction interval: {n_outside} ({pct_outside:.2f}%)")
# Merge actuals and predictions on date
eval_df = prophet_df[['ds', 'y']].merge(
forecast[['ds', 'yhat']], on='ds', how='left'
)
# Drop rows with missing predictions (e.g., if yhat doesn't cover all ds)
eval_df = eval_df.dropna(subset=['y', 'yhat'])
# Calculate metrics
mape = mean_absolute_percentage_error(eval_df['y'], eval_df['yhat'])
mse = mean_squared_error(eval_df['y'], eval_df['yhat'])
print(f"MAPE: {mape:.4f}")
print(f"MSE: {mse:,.0f}")
Total points evaluated: 242 Points outside prediction interval: 44 (18.18%) MAPE: 0.2812 MSE: 710,579,125,884
Conclusion¶
Model Performance Summary
| Model | Valid Points | Outside Boundaries (%) | MAPE (%) | MSE (×10¹²) |
|---|---|---|---|---|
| Simple Moving Average | 238 | 74 (31.09%) | 28.66% | 1.089 |
| Exponential Moving Avg | 242 | 91 (37.60%) | 36.84% | 1.466 |
| Seasonal-Trend Decomp. | 230 | 25 (10.87%) | 21.00% | 0.462 |
| Prophet (baseline) | 242 | 34 (14.05%) | 39.40% | 1.147 |
| Prophet (+ features) | 242 | 43 (17.77%) | 28.12% | 0.711 |
Across the evaluated time series models, the Seasonal-Trend Decomposition (STD) model emerged as the strongest performer, achieving the lowest MSE and MAPE while maintaining tight prediction bounds (only 10.87% of points fell outside the interval). The Simple Moving Average (SMA) model performed moderately but lacked responsiveness to trends, while Exponential Moving Average (EMA) showed high error and the widest prediction gaps. The Prophet model improved considerably when enhanced with relevant external features, reducing its MSE by ~38% and MAPE by over 11 percentage points, though it still exhibited a higher percentage of boundary violations. These results highlight that incorporating domain-specific seasonality and structure (as in STD) or contextual features (as in Prophet+) can significantly improve forecasting accuracy.
All models had difficulties in fitting post Covid-19 data, which is understandable since the aviation industry is still recovering from the pandemic and there are still some irregularities in the flight schedules.